-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel handshaking pattern proposal #66
Conversation
FWIW - this is what Enterprise Gateway does. However, I'm in the process of updating it to use a single response address to which the kernel (launchers) respond, rather than having a different response address for each kernel invocation. The reason for the update is that by having a single response address across all invocations, it introduces the possibility of having the server in one cluster, and the kernel in another (e.g., server running in Kubernetes, while the kernel is running in a Hadoop YARN cluster) because the (single) response address can be published. BTW, although "local" kernels in EG do not use this approach, we have recommended the configuration of "distributed" kernels in loopback just to work around the timing issues you are trying to address since distributed kernels will go through this machinery despite the fact that they are indeed local to the server. The other major difference is that EG utilizes kernel launchers which (preferably) embed the target kernel so as to avoid any kernel modifications to the dozens of kernel implementations. Since this proposal requires updates to the kernels themselves, one suggestion I would make is to not overload the use of the connection file parameter and introduce a separate, and explicit, parameter that is mutually exclusive. I think that would help in troubleshooting and maintenance. I would also suggest that the server (client in the above phrasing) be the thing that produces the connection-file (following receipt of the port information via the socket) not the kernel since the latter (incorrectly) assumes that kernels are always local. |
It's the same in this proposal, the client opens a single socket that will be used by all launched kernels to communicate their port information.
I'm not sure that we can specify mutually exclusive arguments in the kernelspec.
Indeed that would be definitely better, thanks for the suggestion! |
Thanks for the response Johan - glad to see this is a single response address proposal as well.
Ok. This overload approach probably fits better with backward-compatibility support.
Should we just let the kernel vendors add Is |
Are domain sockets a viable approach for you to avoid port conflicts? |
I was going to ask about Windows, but it seems that Windows now has domain sockets? According to
|
Some context and thread to implement a version of this proposal: jupyter/jupyter_client#487 I shamefully did not find time to work on the implementation before going on paternity leave, but I think it has many of the details about why and how different scenarios could be handled. |
I don't think we want to drop support for TCP protocol. Besides, you may want to connect to a kernel from a distant client. |
We want kernels to update to the new mechanism so that we can drop the old one in the long run. |
The random port choice right now does not really work for distant clients- how common are those issues? I'm not saying that TCP should be dropped, just wondering if IPC or domain sockets can be used for your scenario to avoid the need for larger changes. We've been using IPC for this for quite a while now. |
This is orthogonal to distant client. I mean, you could start a kernel and a client on a machine A, and want to connect to that kernel from a client running on a machine B. In that case, you cannot use IPC sockets.
In the context of Voilà, we noticed up to 30% failure (i.e. notebooks not displayed). |
isn't there a missing fields that should be added to the connection file ?
How will the client know which kernel is communicating its ports ? do you need like a |
EG sends the kernel_id to the kernel and it is returned to handle interleaved launches/responses. I guess I still have a concern about using the connection file to convey the response address as it immediately eliminates remote kernels. I think the response address (and whatever else is required) needs to be an argument to the kernel - just as the connection filename is. Also, are there plans for encrypting the response? |
How does it sends it ?
I'm not sure I understand why this eliminates remote kernel. If the response adresse is 192.131.21.12:7737 remote kernel can still ping back to the server; or is there something I misunderstood ? I'm also not sure what the difference is with the response adresse being part of the connection file – or the kernel argument changes.
At that point I'm not sure we can encrypt but at least we may want to authenticate maybe. |
As an explicit argument to the launcher: https://github.com/jupyter/enterprise_gateway/blob/master/etc/kernelspecs/spark_python_kubernetes/kernel.json#L21-L22
Currently, it sounds like the connection file will contain the response address, but the file is local - so the file's contents are not available to the remote kernel and, as a result, it won't know to respond to 192.131.21.12:7737 - all it knows is the name of the connection file (which isn't very helpful and won't exist).
EG also sends a public key (via another parameter) that is used by the kernel (launcher) to encrypt an AES key (to encrypt the response). On return, the AES key is decrypted using the private key and then the payload is decrypted using the AES key. I know this is all probably more than you want but I'm just sharing experiences (and trying to ensure our functionality can somehow continue to exist in some manner). |
Indeed, the client should add a kernel_token to the connection file so that it can identifies which kernel is communicating its ports.
If you pass the reponse address as an argument to the kernel, and if that kernel is remote, you need to send the response address from the client to the remote machine that will launch the kernel. It's the same with the connection file, you need to send the connection file to the remote machine where it can be dumped. In both cases, you need an intermediate actor on the remote machine. Since the kernel command line alreay accepts a conneciton file, and since we cannot specify mutually exclusive arguments in the kernelspec, I think we should stick keep using the conneciton file.
Indeed, since the connection file contains the signature shceme and the key.
Thanks for sharing, this is much apreciated! I don't think this change will break your functionality (although there might be subtlety that I missed): the kernel is still started with a connection file, the only difference is that the kernel will communicate back its port to the client. The connection file still has to be transferred from the client machine to the kernel machine if the kernel is remote (via a gateway or any other intermediate agent). |
Not true. With EG we communicate with a resource manager (kubernetes, docker swarm, hadoop YARN, etc.) to start a launcher that embeds the kernel. The launcher will ultimately produce a connection file to be consumed by its associated kernel (and local to the kernel) but all interaction between the server (client) and the launcher (kernel) takes place via arguments and the launcher code is responsible for encrypting and sending the response (thereby preventing any updates to the associated kernel). For the container-based resource managers, the arguments are passed via environment variables. No file transfers are necessary - particularly because it's not even known at the time of launch on which node the resource manager will decide to run the launcher/kernel. At any rate, thanks for working on this and listening. |
What does prevent you from using the mechanism you already have to pass the additional arguments (i.e. client address and dedicated socket port + kernel token) to the launcher and having it dumping the connection file with these new parameters? |
That's not that big of a deal - although I was hoping (assuming I could move to this proposal) not having to open the "connection file" to extract the pieces as arguments (and believe it's better for the sake of maintainability to have them separate anyway). If I'm understanding things, here's what the launcher would then need to do.
I think it's imperative really, that a) the response parameters be conveyed via a different argument than the connection file and b) that kernels updated to use this approach still retain today's behavior - where the connection file contains the actual connection information. That way, applications that have already addressed port conflict issues can still operate and items listed above will no longer apply. I think at this point, it would be best to meet online where we can probably communicate more interactively. I am willing to adjust for timezone differences - preferring mornings (Pacific TZ). I suspect it's very late for you right now and I appreciate you taking this extra time. If you want to discuss things tomorrow, just ping me in your afternoon. I'm usually at my desk by 7 am PST. FWIW, I have a proposal (in my head for too long) similar to kernel providers but built into jupyter_client where the Popen process interactions are abstracted. I think if that layer were responsible for handling the connection - making KernelManager really just an application-level entity, then we could easily support IPC, TCP (with handshake), TCP (remote), Kubernetes, etc. as part of the base jupyter framework. This is essentially what EG does with process-proxies but there's no reason this framework can't be made available to any jupyter_client application. |
I think we should decompose the what to pass to the kernel and the how, I think we agree on what. A warning against CLI parameter; you do not want to pass security sensitive items via command line arguments as those might be leaked in process list to other users; hence why usually you want an env variable or a file. Many CVEs, on notebook server were due to open chrome with a URL having a token and this being leaked to other process/users even non-privileged; Side note; non-handshake aware kernel could be wrapped in a mini handshake aware proxy on remote host to hide details, might have to jump through hoops, but doable. |
In my opinion, passing the handshake port in the connection file is the appropriate route.
In my opinion, @JohanMabille's proposal is the appropriate course of action to fix this issue. Note that it has to be fixed as the current kernel startup is inherently broken. We would be happy to connect to discuss this in greater detail. |
This (the kernelspec not matching the installed kernel) looks like a deployment problem independent from this JEP. This mismatch could already exist regarding the arguments that a kernel accepts for instance.
What does prevent you from having a proxy between the client and the kernel and passing the address of this proxy in the connection file?
I still don't understand why this would be problematic since you already need to "transfer" the information somehow (transfering the connection file or passing data to a resource manager and then generating the connection file from this data on the remote machine is an implementation detail).
This is specific to jupyter_client, which is a Python implementation of a client in the Jupyter ecosystem. I think the specification should remain agnostic to the languages.
I agree. What's the best medium to ping you? |
Matthias is correct, we should focus on the what instead of the how. I'm sorry this has diverged but I'm just trying to make these two implementations (EG's and this proposal) as compatible as possible.
Thanks. Please use my email address associated with my GH profile. |
@kevin-bates I've just updated the JEP based on the discussion we had today and the feedback from @Carreau . Please let me know if I missed something. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks for your patience and collaboration Johan!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
My only points left are:
- I think JEP for Optional Features #92 would be a better tool to indicate feature support, rather than making it unconditional on protocol version number. But I'm okay without it, since it hasn't happened yet.
- It's still unclear to me if a JEP like this would need the level of detail of an actual spec update, that is:
- what does a connection file look like for a handshake startup?
- what do the registration request/reply messages look like, exactly?
I think it's clear what information should be there, and I think it's okay to save that level of detail to the implementations, but I'm not sure folks agree on that.
|
||
## Proposed Enhancement | ||
|
||
We propose to implement a handshaking pattern: the client lets the kernel find free ports and communicate them back via a dedicated socket. It then connects to the kernel. More formely: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We propose to implement a handshaking pattern: the client lets the kernel find free ports and communicate them back via a dedicated socket. It then connects to the kernel. More formely: | |
We propose to implement a handshaking pattern: the client lets the kernel find free ports and communicate them back via a dedicated socket. It then connects to the kernel. More formally: |
|
||
We propose to implement a handshaking pattern: the client lets the kernel find free ports and communicate them back via a dedicated socket. It then connects to the kernel. More formely: | ||
|
||
- The kernel launcher is responsible for opening a dedicated socket for receiving connection information from kernels (channel ports). This socket will be refered as the registration socket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The kernel launcher is responsible for opening a dedicated socket for receiving connection information from kernels (channel ports). This socket will be refered as the registration socket. | |
- The kernel launcher is responsible for opening a dedicated socket for receiving connection information from kernels (channel ports). This socket will be referred as the **registration socket**. |
|
||
- The kernel launcher is responsible for opening a dedicated socket for receiving connection information from kernels (channel ports). This socket will be refered as the registration socket. | ||
- When starting a new kernel, the launcher passes the connection information for this socket to the kernel. | ||
- The kernel starts, find free ports to bind the shell, control, stdin, heartbeat and iopub sockets. It then connect to the registration socket and send the connection information to the client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The kernel starts, find free ports to bind the shell, control, stdin, heartbeat and iopub sockets. It then connect to the registration socket and send the connection information to the client. | |
- The kernel starts, finds free ports to bind the shell, control, stdin, heartbeat, and iopub sockets. It then connects to the registration socket and sends the connection information to the client. |
|
||
The way the launcher passes the connection information for the registration socket to the kernel should be similar to that of passing the ports of the kernel socket in the current connection pattern: a connection file that can be read by local kernels or sent over the network for remote kernels (although this requires a custom kernel provisioner or "nanny"). This connection file should also contain the signature scheme and the key. | ||
|
||
The kernel should not expect the registration socket to exist after it has sent its registration (i.e. it can be closed). Therefore, the kernel should disconnect from the registration socket right after it has sent its connection information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it wait for a response to acknowledge receipt? That could allow kernels to shut themselves down if the handshake never completes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I'll amend the JEP.
|
||
### Remarks | ||
|
||
This pattern is **NOT** a replacement for the current connection pattern. It is an additional one and kernels will have to implement both of them to be conformant to the Juyter Kernel Protocol specification. Which pattern should be used for the connection if decided by the kernel launcher, depending on the information passed in the initial connection file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is **NOT** a replacement for the current connection pattern. It is an additional one and kernels will have to implement both of them to be conformant to the Juyter Kernel Protocol specification. Which pattern should be used for the connection if decided by the kernel launcher, depending on the information passed in the initial connection file. | |
This pattern is **NOT** a replacement for the current connection pattern. It is an additional one and kernels will have to implement both of them to be conformant to the Jupyter Kernel Protocol specification. Which pattern should be used for the connection is decided by the kernel launcher, depending on the information passed in the initial connection file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we had #92 to indicate support, rather than purely a protocol version number requiring both implementations.
|
||
Although this enhancement requires changing all the existing kernels, the impact should be limited. Indeed, most of the kernels are based on the kernel wrapper approach, or on xeus. | ||
|
||
Most of the clients are based on `jupyter_client`. Therefore, the changes should only be limited to this repository or external kernel provisioners. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the clients are based on `jupyter_client`. Therefore, the changes should only be limited to this repository or external kernel provisioners. | |
Most of the clients are based on `jupyter_client`. Therefore, the changes should only be limited to this repository or external kernel provisioners and additional client libraries, e.g. in other languages. |
I don't think we want this connection pattern to be optional since it's fixing a "bug" in the protocol. For me, optional features should be reserved for:
Besides, the Optional features (at least in the current proposition) are discovered through the |
Fair enough. |
I discussed this JEP with @JohanMabille during JupyterCon and I said I will free time to go into more details. I am planning to do that next week and will report findings ASAP. |
Thank you, @echarles! I appreciate you taking the time to review this more closely. This extremely valuable since you have more expertise than most in this area. ❤️ I think it's important to acknowledge that this PR has reached enough votes to be approved. I think we should still keep this vote open until after your review; unless you find a major glaring issue that would require us to revote, we can consider this approved (since it's been open for >2 years). That said, any questions, comments, (small) concerns you have, Eric, it would be great to have them documented here so we can have written record on things to consider during implementation. Assuming there is no major issues found, I propose we plan to merge this during next week's SSC meeting. |
|
||
We propose to implement a handshaking pattern: the client lets the kernel find free ports and communicate them back via a dedicated socket. It then connects to the kernel. More formally: | ||
|
||
- The kernel launcher is responsible for opening a dedicated socket for receiving connection information from kernels (channel ports). This socket will be referred as the **registration socket**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohanMabille Would the port registration socket change on each kernel launch, or is it static and the socket remains the same for each kernel launch?
(in other words, do we have as many registration sockets as kernels, or only one registration socket for all the kernels)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two options are theoretically possible (thus the wording "The kernel should not expect the registration socket to exist after it has received the acknowledge receipt (i.e. it can be closed)" in the JEP), although I guess that in most of the implementations we would keep a single socket as long as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec should be as clear as possible. If both options are possible, it should be stated clearly in words. Further questions:
- In your many registration socket option, what is the added value of this JEP? The initial issue was the availability of the 5 ports. If a registration port needs to be found on each kernel launch, we don't win a lot and are still subject to failure.
- In your single registration socket option, what process is responsible to maintain that socket? Would it be some code in jupyter-client or will each of the "client" (jupyter-server, jupyter-console...) be responsible for that? (I except you answer to be "hey this is an implementation detail...", I hope it will not be)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec should be as clear as possible. If both options are possible, it should be stated clearly in words.
This can be done in the PR to the doc of the protocol and in the protocol schema. Letting both options opened in the JEP should be enough.
In your many registration socket option, what is the added value of this JEP? The initial issue was the availability of the 5 ports. If a registration port needs to be found on each kernel launch, we don't win a lot and are still subject to failure.
The current pattern is:
1] the client finds free ports by opening sockets on them
2] the clients closes the sockets
3] The clients communicates these ports to the kenrel
4] the kernel tries to open sockets on these ports.
So the problem is that there is a delay between finding free ports and actually opening sockets for using them. With the handshake pattern, the situation is totally different, even in the "many registration socket option": the kernel finds the free ports by opening sockets on them and then does not close them.
In your single registration socket option, what process is responsible to maintain that socket? Would it be some code in jupyter-client or will each of the "client" (jupyter-server, jupyter-console...) be responsible for that? (I except you answer to be "hey this is an implementation detail...", I hope it will not be)
The process responsible for maintaining that socket is the process that is responsible for launching the kernel, whatever it is, since it is the one passing the registration socket info to the kernel upon launch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to write that registration socket lifetime should not be in the spec, as it's an implementation detail irrelevant to kernels, but then I thought about kernel restarts. Presumably restarting kernels need to re-register their (possibly new) sockets, so the registration socket should stay open (or reopen at the same port - not always available) IF kernel restarts are expected. For example:
The client MAY close the registration socket after completing a kernel's registration.
The client MAY use a single registration socket for multiple kernels.
To restart a kernel will require the registration socket again, so the client SHOULD keep the registration socket open while it expects restarts to be possible, OR open a new socket and pass the new registration socket URL to the new process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey sorry, I missed this one. I will update the JEP accordingly.
### Remarks | ||
|
||
This pattern is **NOT** a replacement for the current connection pattern. It is an additional one and kernels will have to implement both of them to be conformant to the Jupyter Kernel Protocol specification. Which pattern should be used for the connection is decided by the kernel launcher, depending on the information passed in the initial connection file. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohanMabille Do we need to specify that coexisting launcher and kernels not supporting both patterns should be possible. e.g. old launcher (single-mode) should be able to launch new kernels (dual-mode)
Similarly, how would.a dual-mode launcher know if the kernel is single- or dual-mode? Is there a "capabilities" JEP for that. I guess it is mandatory for the launcher to know if the kernel support the new handshaking pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated in the JEP:
The kernel specifies whether it supports the handshake pattern via the "kernel_protocol_version" field in the kernelspec:
- if the field is missing, or if its value if less than 5.5, the kernel supports passing ports only.
- if the field value is >=5.5, the kernel supports both mechanisms.
So old launchers can start new kernels, and new launchers will still be able to start old kernels. Reading the kernelspec is enough to know whether a kernel supports both mechanisms or the older one only.
I just had another thought - how does this interact with things like |
Good catch, I think we discussed this at some point, but this is definitely missing in the current wording of the JEP, I'll update it. |
This gives me -3, hence my negative vote on this JEP. |
|
||
- The kernel launcher is responsible for opening a dedicated socket for receiving connection information from kernels (channel ports). This socket will be referred as the **registration socket**. | ||
- When starting a new kernel, the launcher passes the connection information for this socket to the kernel. | ||
- The kernel starts, finds free ports to bind the shell, control, stdin, heartbeat and iopub sockets. It then connects to the registration socket and sends the connection information to the registration socket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The kernel starts, finds free ports to bind the shell, control, stdin, heartbeat and iopub sockets. It then connects to the registration socket and sends the connection information to the registration socket. | |
- The kernel starts, finds free ports to bind the shell, control, stdin, heartbeat and iopub sockets. The connection ports information will be referred as the **connection information**. It then connects to the registration socket and sends the connection information to the kernel launcher through that socket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we formally define the socket message format for the ports information? Or is it irrelevant to a JEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the length of the discussion here, it has been decided to discuss the format of the messages in the PR updating the protocol (similarly to what was done when adding the debugger, where we described the global approach in the JEP, and discussed the details in the PR to the protocol).
I have joined the SSC call yesterday call and we have discussed this issue. I will try to summarise my point-of-vue as most of US participants were on holiday yesterday (this just my perspective and does not represent any consensus, as we have not found any).
I hope this gives a bit more perspective to my negative vote. |
The vote is now closed with the results: In favor: 7 (Itay, Paul, Johan, Min, Rick, Sylvain, Zach) --> In light of those results, this JEP is accepted. |
Kernel Handshaking pattern
The current implementation of Jupyter client makes it responsible for finding available ports and pass them to a new starting kernel. The issue is that a new process can start using one of these ports before the kernel has started, resulting in a ZMQError when the kernel starts. This is even more problematic when spawning a lot of kernels in a short laps of time, because the client may find available ports that have already been assigned to another kernel.
A workaround has been implemented for the latter case, but it does not solve the former one.
This proposal aims to fix this issue for good, with a handshaking pattern.
Voting from @jupyter/software-steering-council